tutorials/039 - Athena Iceberg.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "[![AWS SDK for pandas](_static/logo.png \"AWS SDK for pandas\")](https://github.com/aws/aws-sdk-pandas)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# 39 - Athena Iceberg" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "Athena supports read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue catalog for their metastore. More in [User Guide](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html)." ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "### Create Iceberg table" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import getpass\n", "\n", "bucket_name = getpass.getpass()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": "True" }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import awswrangler as wr\n", "\n", "glue_database = \"aws_sdk_pandas\"\n", "glue_table = \"iceberg_test\"\n", "path = f\"s3://{bucket_name}/iceberg_test/\"\n", "temp_path = f\"s3://{bucket_name}/iceberg_test_temp/\"\n", "\n", "# Cleanup table before create\n", "wr.catalog.delete_table_if_exists(database=glue_database, table=glue_table)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "### Create table & insert data" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "It is possible to insert Pandas data frame into Iceberg table using `wr.athena.to_iceberg`. If the table does not exist, it will be created:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame({\"id\": [1, 2, 3], \"name\": [\"John\", \"Lily\", \"Richard\"]})\n", "\n", "wr.athena.to_iceberg(\n", " df=df,\n", " database=glue_database,\n", " table=glue_table,\n", " table_location=path,\n", " temp_path=temp_path,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "Alternatively, it is also possible to insert by directly running `INSERT INTO ... VALUES`:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "{'QueryExecutionId': 'e339fcd2-9db1-43ac-bb9e-9730e6395b51',\n", " 'Query': \"INSERT INTO iceberg_test VALUES (1,'John'), (2, 'Lily'), (3, 'Richard')\",\n", " 'StatementType': 'DML',\n", " 'ResultConfiguration': {'OutputLocation': 's3://aws-athena-query-results-...-us-east-1/e339fcd2-9db1-43ac-bb9e-9730e6395b51'},\n", " 'ResultReuseConfiguration': {'ResultReuseByAgeConfiguration': {'Enabled': False}},\n", " 'QueryExecutionContext': {'Database': 'aws_sdk_pandas'},\n", " 'Status': {'State': 'SUCCEEDED',\n", " 'SubmissionDateTime': datetime.datetime(2023, 3, 16, 10, 40, 8, 612000, tzinfo=tzlocal()),\n", " 'CompletionDateTime': datetime.datetime(2023, 3, 16, 10, 40, 11, 143000, tzinfo=tzlocal())},\n", " 'Statistics': {'EngineExecutionTimeInMillis': 2242,\n", " 'DataScannedInBytes': 0,\n", " 'DataManifestLocation': 's3://aws-athena-query-results-...-us-east-1/e339fcd2-9db1-43ac-bb9e-9730e6395b51-manifest.csv',\n", " 'TotalExecutionTimeInMillis': 2531,\n", " 'QueryQueueTimeInMillis': 241,\n", " 'QueryPlanningTimeInMillis': 179,\n", " 'ServiceProcessingTimeInMillis': 48,\n", " 'ResultReuseInformation': {'ReusedPreviousResult': False}},\n", " 'WorkGroup': 'primary',\n", " 'EngineVersion': {'SelectedEngineVersion': 'Athena engine version 3',\n", " 'EffectiveEngineVersion': 'Athena engine version 3'}}" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.athena.start_query_execution(\n", " sql=f\"INSERT INTO {glue_table} VALUES (1,'John'), (2, 'Lily'), (3, 'Richard')\",\n", " database=glue_database,\n", " wait=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "{'QueryExecutionId': '922c8f02-4c00-4050-b4a7-7016809efa2b',\n", " 'Query': \"INSERT INTO iceberg_test VALUES (4,'Anne'), (5, 'Jacob'), (6, 'Leon')\",\n", " 'StatementType': 'DML',\n", " 'ResultConfiguration': {'OutputLocation': 's3://aws-athena-query-results-...-us-east-1/922c8f02-4c00-4050-b4a7-7016809efa2b'},\n", " 'ResultReuseConfiguration': {'ResultReuseByAgeConfiguration': {'Enabled': False}},\n", " 'QueryExecutionContext': {'Database': 'aws_sdk_pandas'},\n", " 'Status': {'State': 'SUCCEEDED',\n", " 'SubmissionDateTime': datetime.datetime(2023, 3, 16, 10, 40, 24, 582000, tzinfo=tzlocal()),\n", " 'CompletionDateTime': datetime.datetime(2023, 3, 16, 10, 40, 27, 352000, tzinfo=tzlocal())},\n", " 'Statistics': {'EngineExecutionTimeInMillis': 2414,\n", " 'DataScannedInBytes': 0,\n", " 'DataManifestLocation': 's3://aws-athena-query-results-...-us-east-1/922c8f02-4c00-4050-b4a7-7016809efa2b-manifest.csv',\n", " 'TotalExecutionTimeInMillis': 2770,\n", " 'QueryQueueTimeInMillis': 329,\n", " 'QueryPlanningTimeInMillis': 189,\n", " 'ServiceProcessingTimeInMillis': 27,\n", " 'ResultReuseInformation': {'ReusedPreviousResult': False}},\n", " 'WorkGroup': 'primary',\n", " 'EngineVersion': {'SelectedEngineVersion': 'Athena engine version 3',\n", " 'EffectiveEngineVersion': 'Athena engine version 3'}}" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.athena.start_query_execution(\n", " sql=f\"INSERT INTO {glue_table} VALUES (4,'Anne'), (5, 'Jacob'), (6, 'Leon')\",\n", " database=glue_database,\n", " wait=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "### Query\n" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>John</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>4</td>\n", " <td>Anne</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>Lily</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>Richard</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5</td>\n", " <td>Jacob</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>6</td>\n", " <td>Leon</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 John\n", "1 4 Anne\n", "2 2 Lily\n", "3 3 Richard\n", "4 5 Jacob\n", "5 6 Leon" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.athena.read_sql_query(\n", " sql=f'SELECT * FROM \"{glue_table}\"',\n", " database=glue_database,\n", " ctas_approach=False,\n", " unload_approach=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "### Read query metadata" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "In a SELECT query, you can use the following properties after `table_name` to query Iceberg table metadata:" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "- `$files` Shows a table's current data files" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "- `$manifests` Shows a table's current file manifests" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "- `$history` Shows a table's history" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "- `$partitions` Shows a table's current partitions" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>content</th>\n", " <th>file_path</th>\n", " <th>file_format</th>\n", " <th>record_count</th>\n", " <th>file_size_in_bytes</th>\n", " <th>column_sizes</th>\n", " <th>value_counts</th>\n", " <th>null_value_counts</th>\n", " <th>nan_value_counts</th>\n", " <th>lower_bounds</th>\n", " <th>upper_bounds</th>\n", " <th>key_metadata</th>\n", " <th>split_offsets</th>\n", " <th>equality_ids</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>s3://.../iceberg_test/data/089a...</td>\n", " <td>PARQUET</td>\n", " <td>3</td>\n", " <td>360</td>\n", " <td>{1=48, 2=63}</td>\n", " <td>{1=3, 2=3}</td>\n", " <td>{1=0, 2=0}</td>\n", " <td>{}</td>\n", " <td>{1=1, 2=John}</td>\n", " <td>{1=3, 2=Richard}</td>\n", " <td><NA></td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0</td>\n", " <td>s3://.../iceberg_test/data/5736...</td>\n", " <td>PARQUET</td>\n", " <td>3</td>\n", " <td>355</td>\n", " <td>{1=48, 2=61}</td>\n", " <td>{1=3, 2=3}</td>\n", " <td>{1=0, 2=0}</td>\n", " <td>{}</td>\n", " <td>{1=4, 2=Anne}</td>\n", " <td>{1=6, 2=Leon}</td>\n", " <td><NA></td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " content file_path file_format \\\n", "0 0 s3://.../iceberg_test/data/089a... PARQUET \n", "1 0 s3://.../iceberg_test/data/5736... PARQUET \n", "\n", " record_count file_size_in_bytes column_sizes value_counts \\\n", "0 3 360 {1=48, 2=63} {1=3, 2=3} \n", "1 3 355 {1=48, 2=61} {1=3, 2=3} \n", "\n", " null_value_counts nan_value_counts lower_bounds upper_bounds \\\n", "0 {1=0, 2=0} {} {1=1, 2=John} {1=3, 2=Richard} \n", "1 {1=0, 2=0} {} {1=4, 2=Anne} {1=6, 2=Leon} \n", "\n", " key_metadata split_offsets equality_ids \n", "0 <NA> NaN NaN \n", "1 <NA> NaN NaN " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.athena.read_sql_query(\n", " sql=f'SELECT * FROM \"{glue_table}$files\"',\n", " database=glue_database,\n", " ctas_approach=False,\n", " unload_approach=False,\n", ")" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>path</th>\n", " <th>length</th>\n", " <th>partition_spec_id</th>\n", " <th>added_snapshot_id</th>\n", " <th>added_data_files_count</th>\n", " <th>added_rows_count</th>\n", " <th>existing_data_files_count</th>\n", " <th>existing_rows_count</th>\n", " <th>deleted_data_files_count</th>\n", " <th>deleted_rows_count</th>\n", " <th>partitions</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>s3://.../iceberg_test/metadata/...</td>\n", " <td>6538</td>\n", " <td>0</td>\n", " <td>4379263637983206651</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>[]</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>s3://.../iceberg_test/metadata/...</td>\n", " <td>6548</td>\n", " <td>0</td>\n", " <td>2934717851675145063</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>[]</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " path length \\\n", "0 s3://.../iceberg_test/metadata/... 6538 \n", "1 s3://.../iceberg_test/metadata/... 6548 \n", "\n", " partition_spec_id added_snapshot_id added_data_files_count \\\n", "0 0 4379263637983206651 1 \n", "1 0 2934717851675145063 1 \n", "\n", " added_rows_count existing_data_files_count existing_rows_count \\\n", "0 3 0 0 \n", "1 3 0 0 \n", "\n", " deleted_data_files_count deleted_rows_count partitions \n", "0 0 0 [] \n", "1 0 0 [] " ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.athena.read_sql_query(\n", " sql=f'SELECT * FROM \"{glue_table}$manifests\"',\n", " database=glue_database,\n", " ctas_approach=False,\n", " unload_approach=False,\n", ")" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>made_current_at</th>\n", " <th>snapshot_id</th>\n", " <th>parent_id</th>\n", " <th>is_current_ancestor</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2023-03-16 09:40:10.438000+00:00</td>\n", " <td>2934717851675145063</td>\n", " <td><NA></td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2023-03-16 09:40:26.754000+00:00</td>\n", " <td>4379263637983206651</td>\n", " <td>2934717851675144704</td>\n", " <td>True</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " made_current_at snapshot_id parent_id \\\n", "0 2023-03-16 09:40:10.438000+00:00 2934717851675145063 <NA> \n", "1 2023-03-16 09:40:26.754000+00:00 4379263637983206651 2934717851675144704 \n", "\n", " is_current_ancestor \n", "0 True \n", "1 True " ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = wr.athena.read_sql_query(\n", " sql=f'SELECT * FROM \"{glue_table}$history\"',\n", " database=glue_database,\n", " ctas_approach=False,\n", " unload_approach=False,\n", ")\n", "\n", "# Save snapshot id\n", "snapshot_id = df.snapshot_id[0]\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>record_count</th>\n", " <th>file_count</th>\n", " <th>total_size</th>\n", " <th>data</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>6</td>\n", " <td>2</td>\n", " <td>715</td>\n", " <td>{id={min=1, max=6, null_count=0, nan_count=nul...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " record_count file_count total_size \\\n", "0 6 2 715 \n", "\n", " data \n", "0 {id={min=1, max=6, null_count=0, nan_count=nul... " ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.athena.read_sql_query(\n", " sql=f'SELECT * FROM \"{glue_table}$partitions\"',\n", " database=glue_database,\n", " ctas_approach=False,\n", " unload_approach=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "### Time travel" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>John</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>4</td>\n", " <td>Anne</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>Lily</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>Richard</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5</td>\n", " <td>Jacob</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>6</td>\n", " <td>Leon</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 John\n", "1 4 Anne\n", "2 2 Lily\n", "3 3 Richard\n", "4 5 Jacob\n", "5 6 Leon" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.athena.read_sql_query(\n", " sql=f\"SELECT * FROM {glue_table} FOR TIMESTAMP AS OF (current_timestamp - interval '5' second)\",\n", " database=glue_database,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "### Version travel\n" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>John</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>Lily</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>Richard</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 John\n", "1 2 Lily\n", "2 3 Richard" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.athena.read_sql_query(\n", " sql=f\"SELECT * FROM {glue_table} FOR VERSION AS OF {snapshot_id}\",\n", " database=glue_database,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "### Optimize" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "The `OPTIMIZE table REWRITE DATA` compaction action rewrites data files into a more optimized layout based on their size and number of associated delete files. For syntax and table property details, see [OPTIMIZE](https://docs.aws.amazon.com/athena/latest/ug/optimize-statement.html).\n", "\n" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "{'QueryExecutionId': '94666790-03ae-42d7-850a-fae99fa79a68',\n", " 'Query': 'OPTIMIZE iceberg_test REWRITE DATA USING BIN_PACK',\n", " 'StatementType': 'DDL',\n", " 'ResultConfiguration': {'OutputLocation': 's3://aws-athena-query-results-...-us-east-1/tables/94666790-03ae-42d7-850a-fae99fa79a68'},\n", " 'ResultReuseConfiguration': {'ResultReuseByAgeConfiguration': {'Enabled': False}},\n", " 'QueryExecutionContext': {'Database': 'aws_sdk_pandas'},\n", " 'Status': {'State': 'SUCCEEDED',\n", " 'SubmissionDateTime': datetime.datetime(2023, 3, 16, 10, 49, 42, 857000, tzinfo=tzlocal()),\n", " 'CompletionDateTime': datetime.datetime(2023, 3, 16, 10, 49, 45, 655000, tzinfo=tzlocal())},\n", " 'Statistics': {'EngineExecutionTimeInMillis': 2622,\n", " 'DataScannedInBytes': 220,\n", " 'DataManifestLocation': 's3://aws-athena-query-results-...-us-east-1/tables/94666790-03ae-42d7-850a-fae99fa79a68-manifest.csv',\n", " 'TotalExecutionTimeInMillis': 2798,\n", " 'QueryQueueTimeInMillis': 124,\n", " 'QueryPlanningTimeInMillis': 252,\n", " 'ServiceProcessingTimeInMillis': 52,\n", " 'ResultReuseInformation': {'ReusedPreviousResult': False}},\n", " 'WorkGroup': 'primary',\n", " 'EngineVersion': {'SelectedEngineVersion': 'Athena engine version 3',\n", " 'EffectiveEngineVersion': 'Athena engine version 3'}}" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.athena.start_query_execution(\n", " sql=f\"OPTIMIZE {glue_table} REWRITE DATA USING BIN_PACK\",\n", " database=glue_database,\n", " wait=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "### Vacuum\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "`VACUUM` performs [snapshot expiration](https://iceberg.apache.org/docs/latest/spark-procedures/#expire_snapshots) and [orphan file removal](https://iceberg.apache.org/docs/latest/spark-procedures/#remove_orphan_files). These actions reduce metadata size and remove files not in the current table state that are also older than the retention period specified for the table. For syntax details, see [VACUUM](https://docs.aws.amazon.com/athena/latest/ug/vacuum-statement.html)." ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "{'QueryExecutionId': '717a7de6-b873-49c7-b744-1b0b402f24c9',\n", " 'Query': 'VACUUM iceberg_test',\n", " 'StatementType': 'DML',\n", " 'ResultConfiguration': {'OutputLocation': 's3://aws-athena-query-results-...-us-east-1/717a7de6-b873-49c7-b744-1b0b402f24c9.csv'},\n", " 'ResultReuseConfiguration': {'ResultReuseByAgeConfiguration': {'Enabled': False}},\n", " 'QueryExecutionContext': {'Database': 'aws_sdk_pandas'},\n", " 'Status': {'State': 'SUCCEEDED',\n", " 'SubmissionDateTime': datetime.datetime(2023, 3, 16, 10, 50, 41, 14000, tzinfo=tzlocal()),\n", " 'CompletionDateTime': datetime.datetime(2023, 3, 16, 10, 50, 43, 441000, tzinfo=tzlocal())},\n", " 'Statistics': {'EngineExecutionTimeInMillis': 2229,\n", " 'DataScannedInBytes': 0,\n", " 'TotalExecutionTimeInMillis': 2427,\n", " 'QueryQueueTimeInMillis': 153,\n", " 'QueryPlanningTimeInMillis': 30,\n", " 'ServiceProcessingTimeInMillis': 45,\n", " 'ResultReuseInformation': {'ReusedPreviousResult': False}},\n", " 'WorkGroup': 'primary',\n", " 'EngineVersion': {'SelectedEngineVersion': 'Athena engine version 3',\n", " 'EffectiveEngineVersion': 'Athena engine version 3'}}" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.athena.start_query_execution(\n", " sql=f\"VACUUM {glue_table}\",\n", " database=glue_database,\n", " wait=True,\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.14", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.14" } }, "nbformat": 4, "nbformat_minor": 4 }

tutorials/039 - Athena Iceberg.ipynb (1,170 lines of code) (raw):